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Abstract 

For long tim\ person re-identification and image search 
are two separately studied tasks. However, for person re¬ 
identification, the effectiveness of local features and the 
'‘query-search'’ mode make it well posed for image search 
techniques. 

In the light of recent advances in image search, this pa¬ 
per proposes to treat person re-identification as an image 
search problem. Specifically, this paper claims two major 
contributions. 1) By designing an unsupervised Bag-of- 
Words representation, we are devoted to bridging the gap 
between the two tasks by integrating techniques from image 
search in person re-identification. We show that our system 
sets up an effective yet efficient baseline that is amenable to 
further supervised/unsupervised improvements. 2) We con¬ 
tribute a new high quality dataset which uses DPM detector 
and includes a number of distractor images. Our dataset 
reaches closer to realistic settings, and new perspectives are 
provided. 

Compared with approaches that rely on feature-feature 
match, our method is faster by over two orders of magni¬ 
tude. Moreover, on three datasets, we report competitive 
results compared with the state-of-the-art methods. 

1. Introduction 

This paper considers the task of person re-identification. 
Given a probe image (query), our task is to search in a 
gallery (database) for images that contain the same per¬ 
son. Person Re-identification has important applications 
in video surveillance, e.g., cross-camera visual tracking, 
multi-camera event detection, etc. This task still remains 
an unsolved problem, due to the difficulty in visual match¬ 
ing caused by the extensive variations in illumination, view¬ 
point, pose, photometric settings of cameras, low resolu¬ 
tion, background, etc. 

* Three authors contribute equally to this work. 


Our work is motivated by two aspects. First, local fea¬ 
ture based approaches IS HU go) are proven to be effective 
in person re-identification. Considering the “query-search” 
mode, this is potentially compatible with image search 
based on the Bag-of-Words (BoW) model. Nevertheless, 
some state-of-the-art methods in person re-identification 
rely on brute-force feature-feature matching 114111401 . Al¬ 
though good recognition rate is achieved, this line of meth¬ 
ods suffer from low computational efficiency, which limits 
its potential in large-scale applications. In the BoW model, 
local features are quantized to visual words using a pre¬ 
trained codebook. An image is thus represented by a vi¬ 
sual word histogram weighted by TF-IDF scheme. Instead 
of performing exhaustive visual matching among images 
ED, in the BoW model, local features are aggregated into 
a global vector. In tackling spatial constraints, a number of 
geometric-aware visual matching methods (391 1221 SZl are 
proposed. Moreover, to further boost search accuracy, it is 
beneficial to include some post-processing steps ED ED. 

Second, most existing person re-identification datasets 
m ill El El Ea ED are fiawed either in the dataset scale 
or in the data richness. Specifically, the number of identi¬ 
ties is often confined in several hundred, which may lead to 
the performance instability. Moreover, images of the same 
identity are usually captured by two cameras; each identity 
typically has one image under each camera, so the num¬ 
ber of queries and relevant images is very limited. Fur¬ 
thermore, in most datsets, pedestrians are well-aligned by 
hand-drawn bounding boxes. But in reality, when pedes¬ 
trian detectors are employed, the detected persons may un¬ 
dergo misalignment or body part missing (see Fig. [B. On 
the other hand, pedestrian detectors, while producing true 
positive bounding boxes, also yield false alarms caused by 
complex background or occlusion (see also Fig. [T]). In fact, 
these distractor images may exert non-ignorable influence 
on recognition accuracy. As a result, current methods may 
be biased toward more ideal settings and their effectiveness 
may be impaired once the ideal dataset meets reality. To 
address this problem, it is important to introduce datasets 
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Figure 1. Sample images of the Market-1501 dataset. All images are normalized to 128x64 (Top:) Sample images of three identities with 
distinctive appearance. (Middle:) We show three cases where three individuals have very similar appearance. (Bottom:) Some samples of 
the distractor images (left) as well as the junk images (right) are provided. 


that reach closer to realistic settings and design robust algo¬ 
rithms which can handle detector errors and are not affected 
by distractors. 

Considering the above two issues, this paper makes two 
major contributions. First, inspired by the state-of-the-art 
image search methods, an unsupervised BoW representa¬ 
tion is proposed. After generating a codebook on the train¬ 
ing data, each pedestrian image is represented as a visual 
word histogram. In this step, a number of techniques are 
integrated, e.g., root descriptor Ea, negative evidences 
M, burstiness weighting (Tbl, etc. To incorporate geo¬ 
metric constraints, images are partitioned into horizontal 
stripes. Moreover, multiple queries are pooled into one vec¬ 
tor, which adapts to the extensive image variations. Finally, 
an automatic reranking step is added to refine the initial rank 
list. By simple dot product as similarity measurement, we 
show that the proposed feature representation yields com¬ 
petitive recognition accuracy while enjoying a fast response 
time. 

Second, a new person re-identification dataset, called the 
“Market-1501”, is introduced (Fig. [T]). This dataset is com¬ 
posed of 1501 identities collected by 6 cameras near the 
entrance of a university campus supermarket. To the best 
of our knowledge. Market-1501 is the largest person re¬ 
identification dataset featured by 32643 annotated bound¬ 
ing boxes. It is distinguished from existing datasets in three 
aspects: DPM detected bounding boxes, the inclusion of 
distractor images, and multi-query, multi-groundtruth per 
identity. The Market-1501 dataset provides a more realistic 
benchmark for algorithm evaluation. 

The rest of this paper is organized as follows. After a 
brief review of related works in Section [2l we describe the 
Market-1501 dataset in Section [3] Then, Section [4] intro¬ 
duces the proposed method based on image search tech¬ 
niques. Experimental results are summarized in Section [5] 
and conclusions and insights are given in Section [b] 


2. Related Work 

For person re-identification, both supervised and unsu¬ 
pervised models have been extensively studied these years. 
In discriminative models ||30] [U] [5] itSl [231, classic SVM 
(or the RanksVM ||30l [42l) and boosting ||9l are popular 
choices. For example, Zhao et al. ll4^ learn the weights of 
filter responses and patch matching scores using RankS VM, 
and Gray et al. El perform feature selection among an en¬ 
semble of local descriptors by boosting. Recently, li et al. 
f23i propose a deep learning network to jointly optimize all 
pipeline steps. This line of works are beneficial in reducing 
the impact of multi-view variations, but require laborious 
annotation, especially when new cameras are added in the 
system. On the other hand, in unsupervised models, Faren- 
zena et al. make use of both symmetry and asymme¬ 
try nature of pedestrians and propose the Symmetry-Driven 
Accumulation of Local Features (SDALF). Ma et al. Ezl 
employ the Fisher Vector to encode local features into a 
global vector. To exploit the salience information among 
pedestrian images, Zhao et al. ll4Ql propose to assign higher 
weight to rare colors, an idea very similar to the Inverse 
Document Frequency (IDF) in image search. In this sce¬ 
nario, this paper proposes an unsupervised method which 
requires a minimal amount of labeling or training effort. 

On the other hand, the field of image search has been 
greatly advanced since the introduction of the SIFT descrip¬ 
tor ll^ and the BoW model. In the last decade, a myriad 
of methods C5l|44l[36lll2l[39l have been developed to im¬ 
prove search performance. For example, to improve match¬ 
ing precision, Jegou et al. |[T5l embed binary SIFT features 
in the inverted file. Meanwhile, refined visual matching can 
also be produced by index-level feature fusion ||44l [36l be¬ 
tween complementary descriptors. Since the BoW model 
does not consider the spatial distribution of local features 
(also a problem in person re-identification), another direc¬ 
tion is to model the spatial constraints ||47] [39l [131. The 
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Datasets 

Market-1501 

RAiD El 

CUHK03 ES 

VIPeR 181 

i-LIDS 143 

CUHKOl EH 

CUHK02 ED 

CAVIAR El 

# identities 

1,501 

43 

1,360 

632 

119 

971 

1,816 

72 

# BBoxes 

32,643 

6920 

13,164 

1,264 

476 

1,942 

7,264 

610 

# distractors 

2,793 

0 

0 

0 

0 

0 

0 

0 

# cam. per ID 

6 

4 

2 

2 

2 

2 

2 

2 

DPM or Hand 

DPM 

hand 

DPM 

hand 

hand 

hand 

hand 

hand 

Evaluation 

mAP 

CMC 

CMC 

CMC 

CMC 

CMC 

CMC 

CMC 


Table 1. Comparing Market-lSOl with existing datasets II231I81 I45[I221I211 I2I. 


geometry-preserving visual phrases (GVP) ll39l and the spa¬ 
tial coding ill methods both calculate the relative position 
among features, and check the geometric consistency be¬ 
tween images by the offset maps. Zhang et al. ll38l pro¬ 
pose to use descriptive visual phrases to build pairwise con¬ 
straints, and Liu et al. ll^ encode geometric cues into bi¬ 
nary features embedded in the inverted file. 

For ranking problems, an effective reranking step typ¬ 
ically brings about improvements. Liu et al. ll24l design 
a “one shot” feedback optimization scheme which allows 
a user to quickly refine the search results. Although it is 
shown to yield consistent improvement, reranking based 
on user feedback is not always desirable or accessible. In 
rigid object search, RANSAC ll^ is typically used in post¬ 
processing. In |[33], the top-ranked images are used as 
queries again and final score is the weighted sum of indi¬ 
vidual scores. When multiple queries are present |[T1, a new 
query which integrates the original queries can be formed 
by average or max operations. 

3. The Market-1501 Dataset 
3.1. Description 

In this paper, a new person re-identification dataset, the 
“Market-1501” dataset, is introduced. During dataset col¬ 
lection, a total of six cameras were placed in front of a cam¬ 
pus supermarket, including five 1280x1080 HD cameras, 
and one 720x576 SD camera. Overlapping exists among 
these cameras. This dataset contains 32643 bounding boxes 
of 1501 identities. Since it is an open environment, im¬ 
ages of each identity are captured by at most six cameras. 
We make sure that each annotated identity are captured by 
at least two cameras, so that cross-camera search can be 
performed. In fact, even within same camera, images of 
same identity still take on distinct appearance. Overall, the 
Market-1501 dataset has the following featured properties. 

First, while most existing datasets use hand-cropped 
bounding boxes, the Market-1501 dataset employs a state- 
of-the-art detector, i.e., the Deformable Part Model (DPM) 
Cl- As is shown in Fig. [TJ misalignment as well as body 
part missing are very common among the detected images. 

Second, in addition to the false positive bounding boxes, 
we also provide false alarm detection results. We notice that 
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Figure 2. A toy example of the differenee between AP and CMC 
measurements. True matches and false matches are in green and 
red boxes, respectively. For all three rank lists, the CMC curve 
remains 1. But AP =1,1, and 0.41, respectively. 


the CUHK03 dataset ||23 also uses the DPM detector, but 
the bounding boxes in CUHK03 are relatively good ones 
in terms of DPM detector. In fact, a large number of the 
detected bounding boxes would be very “bad”. Consider¬ 
ing this, for each detected bounding box to be annotated, a 
hand-drawn groundtruth bounding box is provided (similar 
to ||23l). Different from ||23l, for the detected and hand- 
drawn bounding boxes, the ratio of the overlapping area to 
the union area is calculated. In our dataset, if the area ra¬ 
tio is larger than 50%, the DPM bounding box is marked 
as “good” (a routine in object detection (TJ); if the ratio 
is smaller than 20%, the DPM bounding box is marked 
as “distractor”; otherwise, the bounding box is marked as 
“junk” ||29l, meaning that this image is of zero infiuence 
to the re-identification accuracy. Moreover, some obvi¬ 
ous false alarm bounding boxes on the background are also 
marked as “distractors”. In Fig. [T] examples of “good” im¬ 
ages are shown in the top two rows, while “distractor” and 
“junk” images are in the bottom row. 

Third, each identity may have multiple images under 
each camera. Therefore, during cross-camera search, there 
may be multiple queries and multiple groundtruths for each 
identity. A comparison with existing datasets is shown in 
Tablell] 

3.2. Evaluation Protocol 

Current datasets typically use the Cumulated Matching 
Characteristics (CMC) curve to evaluate the performance 
of re-identification algorithms. CMC curve shows the prob- 
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Figure 3. Samples query images. In Market-1501 dataset, queries 
are hand-drawn bounding boxes. Each identity has at most 6 
queries, one for each camera. 



Figure 4. Local feature extraction. We compute the mean CN vec¬ 
tor for each 4x4 patch. These local features are quantized, and 
then pooled in a histogram for each horizontal stripe. 


ability that a query identity appears in different sized can¬ 
didate lists. This evaluation measurement is valid only if 
there is only one groundtruth match for a given query (see 
Fig. [2a)). In this case, the precision and recall are the same 
issue. However, if multiple groundtruths exist, the CMC 
curve is biased because “recall” is not considered. For ex¬ 
ample, CMC curves of Fig. |2b) and Fig. |2c) both equal 
to 1, which fails to provide a fair comparison of the quality 
between the two rank lists. 

For Market-1501 dataset, there are on average 14.8 
cross-camera groundtruths for each query. Therefore, we 
use mean average precision (mAP) to evaluate the overall 
performance. For each query, we calculate the area under 
the Precision-Recall curve, which is known as average pre¬ 
cision (AP). Then, the mean value of APs of all queries, i.e., 
mAP, is calculated, which considers both precision and re¬ 
call of an algorithm, thus providing a more comprehensive 
evaluation. 

Our dataset is randomly divided into training and testing 
sets, containing 750 and 751 identities, respectively. During 
testing, for each identity, we select one query image in each 
camera. Note that, the selected queries are hand-drawn, 
instead of DPM-detected as in the gallery. The reason is 
that in reality, it is very convenient to interactively draw a 
bounding box, which can yield higher recognition accuracy 
1^ . The search process is performed in a cross-camera 
mode, i.e., relevant images captured in the same camera as 
the query are viewed as “junk”. In this scenario, an identity 
has at most 6 queries, and there are 3363 query images in 
total. On average, there are 4.5 query images for each iden¬ 
tity, and each query has 14.8 groundtruth images. Queries 
of two sample identities are shown in Fig. [2 


4. Our Method 

4.1. The Bag-of-Words Model 

For three reasons, we adopt the Bag-of-Words (BoW) 
model. First, this model well accommodates local features, 
which are indicated as effective by previous works 12711401 . 
Second, it enables fast global feature matching, instead of 
exhaustive feature-feature matching ||42 Sll. Third, by 
quantizing similar local descriptors to the same visual word, 
the BoW model achieves some invariance to illumination, 
view, etc. We describe the individual steps below. 

Feature Extraction. As a baseline, we employ the Color 
Names (CN) descriptor oa. Given a pedestrian image nor¬ 
malized to 128x64 pixels, patches of size 4x4 are densely 
sampled. In our experiment, the sampling step is 4, so there 
is no overlapping between patches. For each patch, CN de¬ 
scriptors of all pixels are calculated, and are subsequently 

normalized followed by operator ll32l . The mean 
vector is taken as the descriptor of this patch (see Fig. [4]). 
Codebook. For Market-1501 dataset, we generate a code¬ 
book on the training set. For other datasets, the codebook 
is trained on the independent TUD-Brussels dataset lEl. 
Standard /c-means is used, and codebook size is k. 
Quantization. Given a local descriptor, we employ Multi¬ 
ple Assignment (MA) |[T5l to find its near neighbors under 
Euclidean distance in the codebook. We set MA = 10, so a 
feature is represented by the indices of 10 visual words. 
TF-IDF. The visual word histogram is weighted by TF-IDF 
scheme. TF encodes the number of occurrences of a visual 
word, and IDF is calculated as log ^, where N is the num¬ 
ber of images in the gallery, and rii is the number of images 
containing visual word i. We use the avgIDF da variant in 
place of the standard IDF. 

Burstiness. Burstiness refers to the phenomenon where a 
query feature finds multiple matches in a test image (Tb). 
For CN descriptor, burstiness could be more prevalent due 
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to its low discriminative power compared with SIFT. There¬ 
fore, all terms in the histogram are divided by ^/tf. 
Negative Evidence. Following iT4\ . we calculate the mean 
feature vector in the training set. Then, the mean vector is 
subtracted from all test features. In this way, the zero entries 
in the feature vector are also taken into account using dot 
product. 

Similarity Function. Given a query image Q and a gallery 
image G, the corresponding ^ 2 -normalized feature vectors 
are denoted as {qi, q 2 ,...,qkV and {gi,g 2 ,...,gkV, respec- 
tively, where k is codebook size. The similarity function is 
written as, 

k 

sim{Q,G) = '^qi- gi, (1) 

i=l 

which calculates the dot product between two vectors. 

4.2. Improvements 

Weak Geometric Constraints. In image search, geomet¬ 
ric clues among local features have been demonstrated as 
good discriminator to outlier matches |[33l[T5l[47l[29l. For 
person re-identification, popular approaches ||40l[42l[4T| in¬ 
clude the so-called “Adjacency Constrained Search” (ACS). 
In this method, a patch in the probe is matched with patches 
in a gallery image, which are located in a horizontal stripe 
positioned at similar height with the probe patch. The min¬ 
imum distance is taken as the similarity score. Similar idea 
also appears in DeepReid 1^ . 

ACS is effective in incorporating spatial constraints, but, 
as will be shown in the experiments, it suffers from high 
computational cost. Inspired by Spatial Pyramid Matching 
ESI, we integrate ACS into the BoW model. As illustrated 
in Fig. [4l an input image is partitioned into M horizon¬ 
tal stripes. Then, for stripe m, the visual word histogram 
is represented as ..., where k is the 

codebook size. Consequently, the feature vector for the in¬ 
put image is denoted as / = (d^, ...., d^)^, which is 

the concatenation of vectors from all stripes. When match¬ 
ing two images, dot product (Eq. [T]) is employed, which 
sums up the similarity at all corresponding stripes. In this 
manner, we avoid calculating patch distances for each query 
feature; instead, the calculation is performed at stripe level. 
Background Suppression. The negative impact of back¬ 
ground distraction has been studied extensively 161 1401 ITTIl . 
In one solution, Farenzena et al. propose to separate 
the foreground pedestrian from background using segmen¬ 
tation. In the following works, Zhao et al. |40l (411 use the 
resulting masks to filter out background. 

Since the process of generating a mask for each image 
is both time-consuming and unstable, this paper proposes a 
simple solution by exerting a 2-D Gaussian template on the 
image. Specifically, the Gaussian function takes on the form 
of N{px, cFx^ l^y^cFy)^ where Py are horizontal and ver¬ 


tical Gaussian mean values, and cfx, (Jy are horizontal and 
vertical Gaussian standard variances, respectively. We set 
il^x^l^y) to the image center, and set {ax, ay) = (1,1) for 
all experiments. This method injects a prior knowledge on 
the position of pedestrian, which assumes that the person 
lies in the center, and is surrounded by background. There¬ 
fore, the Gaussian mask works by suppressing the response 
near the edge of the image. 

Multipe Queries. The usage of multiple queries is shown 
to yield superior results in image search (TJ and re¬ 
identification m . If we want to delimit a person, it would 
be worthy of using multiple query bounding boxes and re¬ 
formulating the query image. In this manner, the intra-class 
variation is taken into account, and the algorithm would be 
more robust to pedestrian variations. 

When each identity has multiple query images in a sin¬ 
gle camera, instead of a multi-multi matching strategy O, 
we merge them into a single query for speed consideration. 
Here, we employ two pooling strategies, i.e., average and 
max pooling. In average pooling, the feature vectors of mul¬ 
tiple queries are pooled into one by averaged sum; in max 
pooling, the final feature vector takes the maximum value 
in each dimension from all queries. 

Automatic Reranking. When viewing person re¬ 
identification as a ranking problem, a natural idea consists 
in the usage of reranking algorithms. In this paper, we use 
a simple reranking method which picks top-T ranked im¬ 
ages of the initial rank list as queries to search the gallery 
again. Specifically, given an initial sorted rank list by query 
Q, image Ri which is the image in the list is used as 
query. The similarity score of a gallery image G when us¬ 
ing Ri as query is denoted as S{Ri, G). We assign a weight 
+ = each top-i ranked query, where T 

is the number of expanded queries. Then, the final score of 
the gallery image G to query Q is determined as, 

^ 1 

SiQ,G) = S{Q,G) + Y-—S{Ri,G), (2) 

“ ^ -I- 1 

where S{Q,G) is the weighted sum of similarity scores ob¬ 
tained by the original and expanded queries, and the weight 
gets smaller as the expanded query is located away from the 
top. This method departs from the one proposed in in 
that Eq. [2] employs the similarity values while uses the 
reverse ranks. 

5. Experiments 
5.1. Datasets 

This paper experiments on three datasets, i.e., Market- 
1501, VIPeR HI, andCUHKOS (231. The latter two datasets 
are described below. 

VIPeR. This dataset is composed of 632 identities, and 
each has two images captured from two different cameras. 


4325 



k 

100 

200 

350 

500 

mAP (%) 

13.31 

14.01 

14.09 

13.82 

r=l (%) 

32.20 

34.24 

34.40 

34.14 


Table 2. Impact of codebook size on Market-1501 dataset. We 


report results obtained by “BoW -i- Geo -i- Gauss”. 


M 

1 

4 

8 

16 

32 

mAP (%) 

5.23 

11.01 

13.26 

14.09 

13.79 

r=l (%) 

14.36 

27.53 

32.50 

34.40 

34.58 


Table 3. Impact of number of horizontal stripes on Market-1501 


dataset. We report results obtained by “BoW -i- Geo -i- Gauss”. 


T 

0 

1 2 

3 

4 

5 

mAP (%) 

18.53 

19.18 19.07 

18.97 

19.01 

18.91 


Table 4. Impact of number of expanded queries on Market-1501 
dataset, k = 0 corresponds to the “BoW -i- Geo -i- Gauss + Mul- 
tiQ_max” mode. 


Pedestrians in this dataset undergo large variances in view¬ 
point, illumination, pose, etc. All images are normalized to 
128x48 pixels. VIPeR is randomly divided into two equal 
halves, one for training, and the other for testing. Each half 
contains 316 identities. For each identity, we take an im¬ 
age from one camera as query, and perform cross-camera 
search. 

CUHK03. This dataset contains 13,164 DPM bounding 
boxes of 1,467 identities. Each identity is observed by two 
cameras and has 4.8 images in average for each view. Fol¬ 
lowing the protocol in 1231 , for the test set, we randomly 
select 100 persons. For each person, all the DPM bounding 
boxes are taken as query in turns, and a cross-camera search 
is performed. The test process is repeated 20 times, and sta¬ 
ble statistics are reported. We report both the CMC scores 
and mAP for VIPeR and CUHK03 datasets. 

5.2. Important Parameters 

Codebook size k. In our experiment, codebooks of various 
sizes are constructed, and mAP on Market-1501 dataset is 
presented in Table [2l We can see that the peak value is 
achieved when k = 350. In the following experiments, this 
value is kept. 

Number of stripes M. Table [3] presents the performance 
of different numbers of stripes. As the stripe number in¬ 
creases, a finer partition of the pedestrian image leads to a 
more discriminative representation. So the recognition ac¬ 
curacy increases, but recall may drop for a large M. For 
example, M = 32 produces higher rank-1 matching rate 
but lower mAP than M = 16. As a trade-off between speed 
and accuracy, we choose to split an image into 16 stripes in 
our experiment. 

Number of expanded queries T. Table [4] summarizes the 


(a) VIPeR (b) CUHK03 



Figure 5. Performance of different method combinations on 
VIPeR and CUHK03 datasets. 


results obtained by different numbers of expanded queries. 
We find that the best performance is achieved when T = 1. 
When T increases, mAP drops slowly, which validates the 
robustness to T. The performance of reranking highly de¬ 
pends on the quality of the initial list, and a larger T would 
introduce more noise. In the following, we set T to 1. 

5.3. Evaluation 

BoW model and its improvements. On three datasets, we 
present results obtained by the BoW representation, geo¬ 
metric constraints, Gaussian mask, multiple queries, and 
reranking in Table [5] and Fig. [5] Five major conclusions 
can be drawn. 

First, we find that the baseline BoW vector produces 
a relatively low accuracy: rank-1 matching rate = 9.04%, 
10.56%, and 5.35% on Market-1501, VIPeR, and CUHK03 
datasets, respectively. 

Second, when we integrate geometric constraints by 
stripe matching, we observe consistent improvement in 
recognition accuracy. On Market-1501 dataset, for ex¬ 
ample, mAP increases from 3.26% to 8.46% (+5.20%), 
and an even larger improvement can be seen from rank- 
1 matching rate, from 9.04% to 21.23% (+12.19%). On 
VIPeR dataset, the rank-1 matching rate rises from 9.95% 
to 15.35% (+5.40%). The improvement on CUHK03 is very 
similar. Our results are consistent with previous studies 
(401 [23l in that when narrowing the matching scope, match¬ 
ing precision can be improved. 

Third, it is clear that the Gaussian mask works well on 
all three datasets. Specifically, we observe +5.63% in mAP, 
+4.24% in rank-1 matching rate, and +5.74% in rank-1 
matching rate on Market-1501, VIPeR, CUHK03 datasets, 
respectively. Therefore, the prior that pedestrian is located 
more or less in the center of the bounding box is statistically 
sound. Previous methods dsi address this issue by isolating 
foreground from background, a method which may be in- 
fiuenced by complex background. Another possible direc¬ 
tion consists in modeling the background in video. At this 
point, we plan to release the video together with the bound- 
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Methods 

Market-1501 

VIPeR 

CUHK03 

r= 1 

mAP 

r= 1 

o 

(N 

II 

mAP 

r=l 

mAP 

BoW 

9.04 

3.26 

7.82 

39.34 

11.44 

11.47 

11.49 

BoW + Geo 

21.23 

8.46 

15.47 

51.49 

19.85 

16.13 

15.12 

BoW + Geo + Gauss 

34.40 

14.09 

21.74 

60.85 

26.55 

18.89 

17.42 

BoW + Geo + Gauss + MultiQ avg 

41.66 

17.63 

- 

- 

- 

22.35 

20.48 

BoW + Geo + Gauss + MultiQ max 

42.14 

18.53 

- 

- 

- 

22.95 

20.33 

BoW + Geo + Gauss + MultiQ_max -i- Rerank 

42.14 

19.20 

- 

- 

- 

22.95 

22.70 


Table 5. Results (rank-1, rank-20 matching rate, and mean Average Precision (mAP)) on three datasets by combining different methods, i.e., 
the BoW model (BoW), Weak Geometric Constraints (Geo), Background Suppression (Gauss), Multiple Queries by average (MultiQ_avg) 
and max pooling (MultiQjnax), and reranking (Rerank). Note that, here we use the Color Names descriptor for BoW. 



Rank 


Figure 6. Comparison with the state-of-the-art methods on VIPeR 
dataset. For our method, we combine HS and CN features, as well 
as the eSDC method. 


Stage 

SDALF SDC Ours 

Feat. Extraction (s) 
Search (s) 

Rerank (s) 

2.92 0.76 0.62 

2644.80 437.97 0.98 

0.98 


Table 6. Average query time of different steps on Market-1501 
dataset. Our method achieves significant speedup. 


ing boxes, so that a more precise foreground segmentation 
result can be produced. 

Then, we test multiple queries on CUHK03 and Market- 
1501 datasets, where each query identity has multiple 
bounding boxes. From the results, we can see that the us¬ 
age of multiple queries further improves recognition accu¬ 
racy. The improvement is more prominent on Market-1501 
dataset, where the query images take on more diverse ap¬ 
pearance (see Fig. [3]). Moreover, we notice that multi¬ 
query by max pooling is slightly superior to average pool¬ 
ing, probably because max pooling gives more weights to 
the rare but salient features and improves recall. 

Finally, we observe from Table lU and Table [5] that the 


Methods 

CUHK03 

Market-1501 

r= 1 

r= 1 

mAP 

SDALF |I61 

4.87 

20.53 

8.20 

ITMLlH 

5.14 

- 

- 

imnnES 

6.25 

- 

- 

eSDC BTI 

7.68 

33.54 

13.54 

RANKII281 

8.52 

- 

- 

LDM m 

10.92 

- 

- 

KISSME 091 

11.70 

- 

- 

FPNNll23l 

19.89 

- 

- 

Ours (no MultiQ) 

18.89 

34.40 

14.09 

Ours (MultiQ) 

22.95 

42.14 

19.20 

Ours (+HS) 

24.33 

47.25 

21.88 


Table 7. Comparison with the state-of-the-art methods on 
CUHK03 and Market-1501 datasets. 


reranking process generates higher mAP. Nevertheless, one 
recurrent problem with reranking is the sensitivity to the 
quality of initial rank list. On Market-1501 and CUHK03 
datasets, since a majority of queries DO NOT have a top-1 
match, the improvement in mAP is relatively small. For 
poor initial ranks, reranking would generate inferior results. 
Therefore, algorithms that produce higher accuracy may 
benefit more from reranking. Another tentative solution is 
to involve human interaction in the loop. 

Timings. When the gallery gets scaled up (consider a city- 
scale re-identification system for an example), a fast re¬ 
sponse time is desirable. To evaluate this property, we com¬ 
pare our method with SDALF O and SDC IdTIl in two as¬ 
pects, i.e., feature extraction and search time. 

Our evaluation is performed on a server with 3.46 
GHz CPU and 128 GB memory, and efficiency results 
are shown in Table [6l We report the total timing by HS 
(we extract a 20-dim HS histogram and generate another 
BoW vector for fusion with CN) and CN features for our 
method. Compared with SDC, we achieve over two orders 
of magnitude efficiency improvement. In the SDALF 
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BoW 

AP = 0.47% 



+Geo+Gauss 
AP = 10.12% 


+MultiQ_max 
AP = 38.56% 


+Rerank 
AP = 70.03% 


Figure 7. Sample re-identification results on Market-1501 dataset. Four rows correspond to four configurations, i.e., “BoW”, “BoW -i- Geo 
-I- Gauss”, “BoW -i- Geo -i- Gauss + MultiQ”, and “BoW -t- Geo -i- Gauss -i- MultiQ -i- Rerank”. The original query is in blue bounding box, 
and the added multiple queries are in yellow. Images with the same identity as the query is in green box, otherwise red. 


method, three features are employed, i.e., MSCR, wHSV, 
and RHSR The feature extraction time is 0.09s, 0.03s, 
2.79s, respectively; the search time is 2643.94s, 0.66s, and 
0.20s, respectively. Therefore, our method is faster than 
SDALF by three orders of magnitude. From these results, 
we can see that a major efficiency gain is achieved. 

Comparison with the state-of-the-arts. We compare our 
results with the state-of-the-art methods in Fig. [6] and Table 
[71 On VIPeR dataset (Fig. O, we first compare with un¬ 
supervised methods, e.g., eSDC BTIl . SDALF 0. We can 
see that our approach is superior to both methods. Specif¬ 
ically, we achieve a rank-1 identification rate of 26.08% 
when two features are used, i.e., Color Names (CN) and HS 
Histogram (HS). When eSDC 1411 is further integrated, the 
matching rate increases to 32.15%, a competitive accuracy 
among all competing methods. 

Moreover, on CUHK03 dataset, our method without 
multiple-query (MultiQ) significantly outperforms almost 
all presented approaches. Compared with FPNN ll23]| which 
builds a deep learning architecture, our accuracy is slightly 
lower by 1.00%. But when multiple queries and HS fea¬ 
ture are integrated, our rank-1 matching rate exceeds ll23]| 
by -\-4.44% on CUHK03 dataset. On Market-1501 dataset, 
compared with SDALF (S), eSDC Ell, and KISSME |[T9l , 
our results are consistently higher. 

Some sample results on Market-1501 dataset are pro¬ 
vided in Fig. [71 Apart from the mAP increase with the 
method evolution, another finding which should be noticed 
is that the distractors detected by DPM on complex back¬ 
ground or body parts severely affect re-identification accu¬ 


racy. Previous works typically focus on “good” bounding 
boxes with person only, and rarely study the detector errors. 
In this sense, the Market-1501 dataset provides a more real¬ 
istic environment for such evaluation. 

6. Conclusions and Insights 

This paper focuses on person re-identification. Over¬ 
all, two major contributions are made. The first contri¬ 
bution consists in bridging the gap between person re¬ 
identification and BoW based image search. Specifically, 
the bag-of-words model with extensive improvements is ap¬ 
plied, which considers the spatial constraints and the multi¬ 
query multi-groundtruth information. Second, a new person 
re-identification dataset, the Market-1501, is introduced. 
This dataset gets closer to the realistic settings, and once 
released, is one of the largest datasets in this field. Bound¬ 
ing boxes in the Market-1501 dataset are detected by DPM. 
Apart from annotated pedestrian images, we also include a 
number of false positive detection results, and view them as 
distractor or junk images. 

The BoW representation, though unsupervised, achieves 
competitive results on three datasets, while speeding up the 
search process by over two orders of magnitude. Both are 
desirable properties in industrial usage and provide new per¬ 
spectives in the field of person re-identification. We spec¬ 
ulate that this model can be further improved in several di¬ 
rections. First, supervised approaches can be readily incor¬ 
porated, e.g.. RanksVM, metric learning, etc, so that the 
global vector is more discriminatively weighted. This idea 
also works for multi-feature fusion, where descriptors such 
as VLAD 03, CNN ESI, can be effectively combined. 
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Second, our experiment highlights the importance of geo¬ 
metric constraints and foreground estimation. In fact, the 
geometric cues can be more elaborately encoded |[39j [471 ; 
the root and parts detected by DPM can be also incorpo¬ 
rated. Moreover, the foreground can be more precisely 
located by modeling background statistics through video 
analysis. Finally, the strength of multiple queries can be fur¬ 
ther explored by SVM 1321 or spatial verification 1291 . The 
Market-1501 dataset will be a useful benchmark enabling 
these research possibilities. 
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